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Artificial intelligence (AT) is the discipline focused on enabling computers to 
operate autonomously without explicit programming. Within AI, computer 
vision is an emerging field tasked with endowing machines with the ability 
to interpret visual data from images and videos. Over recent decades, 
computer vision has found applications in diverse fields such as autonomous 
vehicles, information retrieval, surveillance, and understanding human 
behavior. Object detection, a key aspect of computer vision, employs deep 
neural networks to continually advance detection accuracy and speed. Its 
goal is to precisely identify objects within images or videos and assign them 
to specific classes. Object detection models typically consist of three 
components: a backbone network for feature extraction, a neck model for 
feature aggregation, and a head for prediction. The focus of this study lies on 
two stage detectors. This study aims to provide a comprehensive review of 
two stage detectors in object detection, followed by benchmarking to offer 
insights for researchers and scientists. By analyzing and understanding the 
efficacy of these models, this research seeks to guide future developments in 


the field of object detection within computer vision. 
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1. INTRODUCTION 

Object detection is often called image detection, object identification, and object recognition; and all 
these concepts are synonymous (Figure 1). It is a computer vision method for locating instances of objects in 
an image or video sequence. Object detection algorithms, therefore, typically benefit from machine learning 
techniques or deep learning techniques to gain meaningful results. When humans look at images or videos, 
they could locate and recognize objects of interest easily. The goal of object detection is to mimic this 
intelligence using a computer. With recent advancements in deep learning-based computer vision models, 
object detection use cases are spreading more than ever before. A wide range of applications is implemented, 
for instance, self-driving cars, object tracking, anomaly detection, and video surveillance. 

The paper explores two-stage detector models, focusing on their relevance and advancements within 
the field of object detection. In the related works section, existing research is reviewed to contextualize the 
study. Background details fundamental concepts, including deep neural networks and model architecture. 
comparison evaluates various models based on performance metrics. Results present empirical findings, 
while discussion interprets and discusses implications. Finally, the conclusion summarizes key findings and 
suggests future research directions. 
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Figure 1. Comparison of visual recognition tasks in computer vision 


2. RELATED WORKS 

Several scientific works and research have been implemented to develop and evolve object detection 
applications and systems and depend on enormous methodologies of the deep learning era, machine learning 
era, and, other eras. Several researchers and scientists are expanding their implementation and research to 
develop and apply enormous methodologies (Figure 2). Such is the case of feature aggregation methods that are 
used to make a connection between low and high features for better object recognition in video sequences and 
images. Feature aggregation is used widely in action recognition [1]-[5], and video description [6], [7]. Most of 
these methods use recurrent networks (RNNs) to aggregate features from consecutive frames on the one hand. 
Exhaustive temporal-spatial convolution is used to extract temporal-spatial features, on the other hand. U-Net 
[8] was proposes to concatenate features from low-level to high-level for medical image segmentation, and it 
achieved great success in that field. To gain an outstanding feature for object detection, the feature pyramid 
networks (FPN) aggregated both the transformed feature from the bottom-up weighted pyramid and the top- 
down lateral convolutions through a simple sum operation. relied on feature pyramid networks, several 
extensive works [9]-[12] define a new option for connectivity between scales. Attention-based models also 
prove their efficiency in several applications of deep learning era [13]-[18]. Self-attention models by measuring 
and apply. The unified architecture of two-stage detector methodologies (Figure 3) typically consists of three 
main components: a backbone network, a proposal generation stage, and a refinement stage. These 
methodologies are commonly used in object detection tasks to localize and classify objects in an image. 
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Figure 2. Taxonomy of two-stage detector models based on deep learning 
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Figure 3. The unified architecture of two-stage detector methodologies 
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3. BACKGROUND 

Faster region-based convolutional neural network (R-CNN) [19], as the name faster R-CNN refers is 
an extension of fast R-CNN, as well as the name, suggests faster R-CNN is faster than its previous fast 
R-CNN which emphasizes the strongness of the region proposal network (RPN). By the use of RPN which 
refers to a fully convolutional network that is responsible for generating proposals with different aspect ratios 
and various scales. In their paper, they introduce the anchor boxes concept, rather than the use of pyramids of 
filters. An anchor box is a specific aspect and scale ratio reference (Figure 4). 
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Figure 4. Faster R-CNN architecture 


CoupleNet [20] is an object detection that gathers the object proposals gathered by RPN and then 
fed them into the coupling module that combines two branches. The first branch captures the local part 
feature of an object using position-sensitive RoI (PSRol), and the other branch for encoding the context and 
global features using Rol pooling. The ResNet-101 is used as a backbone for removing the FC layer, and 
average pooling. Then each proposal is fed into two branches global fully convolutional network (FCN) and 
local FCN. Then, finally, both local and global FCNs are combined to produce the final result (Figure 5). 
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Figure 5. CoupleNet architecture 


Fast R-CNN [21], as the name refers fast R-CNN is an extension of R-CNN, and it overcomes 
several of its issues. As the name refers to the fast R-CNN is faster than R-CNN. Fast R-CNN proposed a 
layer called region of interest or ROI pooling which tries to extract feature vectors from proposals. Compared 
to the R-CNN model, which covers multiple stages starting from region proposal generation then feature 
extraction, and finally classification using support vector machine (SVM), faster R-CNN uses just one neural 
network that has only just one stage. Faster R-CNN spread convolutional layer calculations across all 
proposals. By making use of ROI pooling layer that makes fast R-CNN faster and more accurate than 
R-CNN. The fast R-CNN model does not cache extracted features which decrease the use of disk storage 
compared to R-CNN (Figure 6). 

SPP-Net [1], is one of the convolutional neural network (CNN) models that utilize spatial pyramid 
pooling which removes the fixed size of the neural network. On top of the last layer, a SPP is added, which 
pools the features as well as generates a fixed length of outputs that will be used in a fully connected layer. 
For avoiding the need for wrapping and cropping, they perform information aggregation at a deeper stage of 
the neural network (Figure 7). 
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Figure 6. Fast R-CNN architecture 
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Figure 7. SPP-Net architecture 


RepPoints detector (RPDet) [2] is one of the two-stage detectors, which are based on deformed 
CNNs and it is an anchor-free model. RepPoints is used as the sample and basic object representation inside 
the object detection system. The starting RepPoints are acquired based on regressing offsets over center 
points. The learning process of these gained RepPoints is driven by two goals: the object recognition loss of 
the stages as well as the bottom right and top left points distance loss among the ground truth and the induced 
pseudo box (Figure 8). 
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Figure 8. RepPoints architecture 


Libra R-CNN [3] is an object detector that tries to reach a balanced training process, not like the past 
detectors that have suffered from an imbalanced training process, which in general combines three different 
levels sample level, feature level as well as an objective level. Libra R-CNN covers three other different levels: 
IoU-balanced sampling, balanced feature pyramid, and finally balanced L1 loss to reduce the imbalance at the 
feature stage, sample stage, and objective stage (Figure 9). 
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Figure 9. Libra R-CNN architecture 
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Muli-region CNN (MR-CNN) [4] is an object representation utilizing several regions to gather 
various aspects of an object. The first stage consists of passing an image through an activation map module 
and getting an activation map. The different bounding box candidates or region proposals are generated 
utilizing the selective search. Additionally, VGG-16 is used as a backbone and the last max pooling is 
removed (Figure 10). 
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Figure 10. MR-CNN architecture 


DeepID-Net [5] is an object detection belonging to multi-stage models and deformable CBBs, that 
have multiple innovations in several aspects, in the new proposed architecture, a new structure pooling layer 
is proposed. The integration of multiple classifiers optimized the path samples at several stages and levels. As 
well as defining a new training approach to learn the deep feature representation for reaching an important 
generalization capability and a good object detection result. This model improves the modeling by relying on 
several techniques including changing the neural network structure, as well as the training strategies by 
adding some stages and removing others inside the detection pipeline. Which gives us a crucial diversity of 
performant models (Figure 11). 
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Figure 11. DeepID-Net architecture 


Region based fully convolutional network (R-FCN) [6] is an object detection region based. This 
model is a fully convolutional network that shares the computation on the entire image, in comparison to the 
other ones such as fast/faster R-CNN which are region-based object detection that applies a subnetwork 
many times. To reach this R-FCN uses position-sensitive score maps to address an issue between translation 
variance in object detection and translation invariance in image classification (Figure 12). 
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Figure 12. R-FCN architecture 


Grid R-CNN [7], is an object detection model, where a grid point-guided localization is made in the 
traditional regression formulation place. The model divides the bounding box region into grids and then 
implements the FCN. Grid R-CNN gathers the grid point location and the explicit special information due to 
fully CNN architecture and they can be obtained in pixel levels. Based on the grid points, this model can detect 
performant bounding boxes. This grid vision can make it better than the regression methods (Figure 13). 
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Figure 13. Grid R-CNN architecture 


TridentNet [8], is an object detection model that is used to tackle scale variation issues. Both 
categories of object detection one or two-stage detectors don’t handle scale variation. Indeed, there are 
different ways to solve that issue but the problem it increases the time inference (Figure 14). 
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Figure 14. TridentNet architecture 


ION [9], is an object detection model that tries to utilize information both outside the region of 
interest and inside the region of interest. Regarding the information outdoors the region of interest is 
integrated utilizing a special recurrent neural network (RNN). Indoor, skip pooling is used to extract and 
gather features at multiple levels and scales of abstraction (Figure 15). 
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Figure 15. ION architecture 


Gated bi-directional CNN (GBD-Net) [10], is an object detection model that is implemented under fast 
R-CNN, which concentrates on bi-directional CNN named GBD-Net to transfer features between different 
regions inside two stages of feature extraction and feature learning. Those features transferred can be 
implemented based on convolution among neighbored regions and transferred in two directions among several 
layers. Thus, contextual and local patterns can emphasize the existence of their self by gathering the nonlinear 
relationships as well as their complex interactions. This model affirms that message passing and transferring is 
not always helpful. The messages transmitted are controlled by the use of gated functions. Model handed used a 
set of backbones for instance ResNet and inceptions as feature extraction models (Figure 16). 
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Figure 16. GBD-Net architecture 


Mask R-CNN [11], is an object detection model that is used for object detection as well as instance 
segmentation. It is used for generating segmentation masks for each instance. It is an extension of faster 
R-CNN, which add a branch used for predicting objects’ masks. This makes Mask R-CNN easy to be 
implemented by using just an overhead (Figure 17). 

Light-head R-CNN [12], the heavy-head design of the two-stage object detection model makes them 
slow in comparison with two-stage detectors. Light-head R-CNN is trying to tackle this shortcoming of two-stage 
detectors, by making and designing the head as light as possible, by utilizing a thin feature map as well as a cheap 
R-CNN combining fully connected layers and pooling. ResNet-101 is used as a backbone (Figure 18). 

Structure inference network (SIN) [13], combined faster R-CNN with a graphical model that tries to 
infer object state. SIN model makes vital the act of taking into consideration not only visual appearance but 
also making use of object interaction as well as scene information. This model transforms the object 
detection issue into a graph structure inference. The objects are seen as nodes among a graph and the relation 
among them is seen as edges (Figure 19). 
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Figure 17. Mask R-CNN architecture 


Light-Head R-CNN 


' 
Convolutional neural network | 
| 
| 
| 
| 
| 
| 


; Large 

| separable 

convolution PSRol or | 
5 Rol Pooling 

| Layer | 

—— ee 


i 
l 
l 
l Sofmax 


Conv! classification 


Pool1 Pool2 
Pool3 Pool4 490 channels in 


coco 


2048 channels in Resnet-101 


Rol 
Pooling \ Fe | 
Whole Image l Scene 


ae 
< Kl Rol 


l 
| 
| 
li 
H 
l f Sofmax 
| | classification | 
s l p 
5 S 
i l 
x+ o— 
Rol l Fe Ty 
1 | 
l 
l 
| 
l 
l 
li 
l 
| 
l 
l 
T 
l 
l 
4 


| 
| 
| 
| 
| 
Pooling p | 
Nodes lai e 
| Regression 
I 
| 
| 
| 
| 


Projection 
a. ae 


Figure 19. SIN architecture 


Detection transformer (DETR) [14], tries to remove the need for some stages such as non-maximum 
suppression as well as anchor generation that encode the prior knowledge. The new model stages are a set-based 
global loss that forces unique prediction, as well as a transformer encoder-decoder model. This model, DETR, 
reasons about the relation between the context which is the image, and the object to output a set of predictions. 
DETR is conceptually easy and can be simply investigated to perform panoptic segmentation (Figure 20). 

HyperNet [15], is an object detection model that employs region proposals to control an object’s 
instance search. To get high recall among regions proposal methods need enormous proposals, which hurts 
the object detection efficiency. As well as these models are still struggling with small-size objects. HyperNet 
aims to handle object detection jointly and region proposal generation. HyperNet tries to aggregate feature 
maps and then compresses them into a unique uniform space (Figure 21). 

Multi-scale CNN (MS-CNN) [16], is a unified deep neural network, which is proposed to tackle fast 
and multi-scale detection. MS-CNN proposed two sub-networks a detection one and a proposal one. In the 
first sub-network, proposal one the detection is reached at multiple output layers, so that receptive match 
targets of several scales. These scale detectors produce a multi-scale object detection model. Feature 
upsampling is reached by deconvolution to reduce computation costs and memory (Figure 22). 
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4. COMPARISON 


The table handed contains famous one-stage object detection models that are trained on the common 
object in context (COCO) dataset result the different metrics (Figure 23). Table 1 combines the different 
models with their backbone models which are used for feature extraction as well as feature fusion models 
that are used for feature fusion. Additionally, to evaluate metrics at the same time papers’ names and their 
published years. 


Average Precision (AP): AP Across Scales: 

AP % AP at IoU=.50:.05:.95 (primary challenge metric) APsmall % AP for small objects: area < 322 
APIoU=.50 % AP at IoU=.50 (PASCAL VOC metric) APmedium % AP for medium objects: 322 < area < 962 
APIoU=.75 % AP at IoU=.75 (strict metric) APlarge % AP for large objects: area > 962 


Source Information : https://cocodataset.org/#detection-eval 


Figure 23. Evaluation metrics evaluation 


Bulletin of Electr Eng & Inf, Vol. 13, No. 3, June 2024: 1694-1706 


Bulletin of Electr Eng & Inf ISSN: 2302-9285 O 1703 


Table 1. Different two stage detector models relied on different evaluation metrics 


Model ser ae AP  AP50 AP75 APS APM APL Paper Year 
Deforma ResNeXt- 52.3 71.9 58.1 34.4 54.4 65.6 Deformable DETR: Deformable 2021 
ble 101+DCN Transformers for End-to-End 
DETR Object Detection [17] 

RepPoin ResNeXt-101, 52.1 70.1 57.5 34.5 54.6 63.6 RepPoints V2: Verification Meets 2020 
ts v2 DCN, multi- Regression for Object Detection 
scale [19] 
Trident ResNet-101- 48.4 69.7 53.5 31.8 51.3 60.3 Scale-Aware Trident Networks 2019 
Net Deformable, for Object Detection [8] 
image pyramid 
PANet ResNeXt-101, 47.4 67.2 51.8 30.1 51.7 60.0 Path Aggregation Network for 2018 
multi-scale Instance Segmentation [19] 
SNIPER ResNet-101 46.1 67.0 51.6 29.6 48.9 58.1 SNIPER: Efficient Multi-Scale 2018 
Training [20] 
Mask R- HRNetV2p- 46.1 64.0 50.3 27.1 48.6 58.3 Deep High-Resolution 2021 
CNN W48+cascade Representation Learning for 
Visual Recognition [21] 
Faster LIP-ResNet- 43.9 65.7 48.1 25.4 46.7 56.3 LIP: Local Importance-based 2019 
R-CNN 101-MD w FPN Pooling [22] 
SNIPER ResNet-50 43.5 65.0 48.6 26.1 46.3 56.0 SNIPER: Efficient Multi-Scale 2018 
Training [20] 
D- ResNet-101, 43.4 65.5 48.4 27.2 46.5 54.9 An Analysis of Scale Invariance 2018 
RFCN+ multi-scale in Object Detection—SNIP [23] 
SNIP 
Grid R- ResNeXt-101- 43.2 63.0 46.6 25.1 46.5 55.2 Grid R-CNN [7] 2019 
CNN FPN 
LibraR- ResNeXt-101- 43.0 64 47 25.3 45.6 54.6 Libra R-CNN: Towards Balanced 2019 
CNN FPN Learning for Object Detection [3] 
Trident ResNet-101 42.7 63.6 46.5 23.9 46.6 56.6 | Scale-Aware Trident Networks 2019 
Net for Object Detection [8] 
Faster HRNetV2p- 42.4 63.6 46.4 24.9 44.6 53.0 Deep High-Resolution 2021 
R-CNN W48 Representation Learning for 
Visual Recognition [21] 
RPDet ResNet-101 41 62.9 44.3 23.6 44.1 51.7 RepPoints: Point Set 2019 
Representation for Object 
Detection [2] 
Mask R- ResNet-101- 40.1 60.5 44.1 35.8 57.3 38.5 Cross-Iteration Batch 2021 
CNN FPN, CBN Normalization [24] 
Fast R- Cascade RPN 40.1 59.4 43.8 224 42.4 51.6 | Cascade RPN: Delving into High- 2019 
CNN Quality Region Proposal Network 
with Adaptive Convolution [25] 
ION ResNexXt- 33.1 55.7 34.6 14.5 35.2 47.2 Inside-Outside Net: Detecting 2016 
101+DCN Objects in Context with Skip 
Pooling and Recurrent Neural 
Networks [9] 
5. RESULTS 


After plotting the main table that contains the different models Table 1, we get the handed plot 
(Figure 24) which emphasize the strongness of the newer model, such as deformable detr based on 
ResNeXt-101-DCN as well as RepPoints-V2 which is relied on ResNeXt-DCN also. 


5.1. Based on average precision 

Box average precision (AP): AP % AP at IloU=50:05:95 (primary challenge metric). In terms of AP, 
deformable DETR has reached the highest score of 52.3 as mentioned in the figure. Deformable DETR which 
combines ResNeXt-101 and DCN as the backbone in addition to RepPoints v2 which occupied the second 
position with a difference of 0.02 which relied on ResNeXt-101, DCN, multi-scale as a backbone (Figure 25). 

AP50: AP IoU=50% AP at IoU=50 (PASCAL VOC metric). As for the AP metric, Deformable 
DETR occupied the first position in terms of AP50 based on the ResNeXt-101 and DCN as the backbone. As 
well as RepPoints v2 is reaching the second position with a difference of 1.8 (Figure 25). 

AP75: % AP at IoU=75 (strict metric). Regarding AP75, deformable DETR with the same 
backbone, as mentioned in AP and AP50, occupied the first position with 58.1. The second position is taken 
by RepPoints v2 with a difference of 0.6 (Figure 25). 
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Figure 24. Two-stage detectors comparison plot based on different evaluation metrics 
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Figure 25. Two-stage detectors comparison based on AP, AP50, and AP75 


5.2. Based on AP across scales 


Average precision small (APS): % AP for small objects: area <322. This is the first time that we 
recognize that another model other than Deformable DETR reached the high score in terms of APS which is 
Mask R-CNN based on (ResNet-101-FPN, CBN) as a backbone and feature aggregation model. RepPoints 
V2 is saving its place with a 0.1 difference (Figure 26). Average precision medium (APM): % AP for 
medium objects: 322< area <962. Regarding the APM evaluation metric Mask R-CNN with the same 
combined models occupied the first position, in addition to RepPoints v2 which save its second place this 
time also with a difference of 2.7. Average precision large (APL): % AP for large objects: area >962 see 
Figure 26. Deformable DETR with ResNeXt-101 as a backbone reached 65.6 in its first position additionally 
RepPoints v2 with ResNeXt-101 as a backbone reached 63.6 with a difference of 2 see Figure 26. 
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Figure 26. Two-stage detectors comparison based on APL, APM, APS 


Bulletin of Electr Eng & Inf, Vol. 13, No. 3, June 2024: 1694-1706 


Bulletin of Electr Eng & Inf ISSN: 2302-9285 Oo 1705 


6. DISCUSSION 

We present the widespread two-stage object detection methods. This paper covers several 
interesting models, we started by listing the different major branches of computer vision, image 
classification, object detection, semantic segmentation, and instance segmentation. After clarifying the main 
differences among the cited branches, we presented the part of the related work which presented the same 
works but after deep research, for gathering the related articles there are no studies that focus on just one 
category of object detection models as we do here just concentrate on two-stage object detection models. 

In our third stage, we analyze each two-object detector model separately by offering the main 
architecture as well as a detailed description of the main components utilized starting from input then the 
backbone after that the neck and the head model. These components are the most common components 
combined. At the end of these stages and after discussing and analyzing these models’ architecture, we dive 
into a benchmark table that combined the enormous models additionally to their evaluation metrics’ score 
based on AP such as box AP, AP50, AP75, and those relied on across scale such as APS, APM, APL, all the 
cited models are implemented on COCO dataset. 

After gathering a detailed benchmarking, we visualize the results carefully based on AP and across 
scale metrics each one separately. The results are analyzed relying on the handed plots. After visualizing and 
discussing the best results we can conclude that deformable DETR relied on ResNext-101 and DCN as a 
backbone as well as RepPoints V2 which based on the same backbone are heading the listed models in terms 
of different metrics. Except that Mask-RCNN based on ResNet-101 for feature extraction and FPN as a 
model for feature fusion deals better in terms of APS, and APM. From the noticed results we emphasize that 
using ResNeXt-101 with DCN are constructing a great team in term of AP and terms of across scale metrics 
Mask R-CNN is dealing better which clarify the strongness of ResNet-101-FBN-CNB as an architecture. The 
handed results going to help us in constructing other models that are inspired carefully by the best of the cited 
models to gain better performance. Such as the case of using ResNext-101. 


7. CONCLUSION 

In conclusion, this paper has presented an extensive overview of two-stage object detection 
methods, covering various models within the realm of computer vision. Beginning with an exploration of 
major branches such as image classification, object detection, semantic segmentation, and instance 
segmentation, we proceeded to delve into a comprehensive review of related works, focusing specifically on 
two-stage object detection models. Through meticulous analysis, each two-stage detector model was 
dissected, elucidating their architectures and key components, including input, backbone, neck, and head 
models. These components represent the fundamental building blocks shared across most models. 

Furthermore, we conducted a detailed benchmarking, evaluating the performance of these models on 
the COCO dataset using metrics such as AP and across scale APS, APM, APL. Visualization of the results 
facilitated a nuanced discussion, revealing standout performers such as deformable DETR and RepPoints V2, 
both leveraging ResNext-101 with DCN as a backbone. notably, Mask-RCNN, utilizing ResNet-101 and 
FPN, demonstrated superior performance in terms of APS and APM. This underscores the efficacy of the 
ResNet-101-FPN-CNB architecture for across-scale tasks. These findings serve as a valuable resource for 
informing the development of future models, emphasizing the potential benefits of architectures such as 
ResNext-101 with DCN. By leveraging insights from top-performing models, we aim to enhance the 
performance of subsequent iterations and push the boundaries of object detection capabilities. 
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